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The structure of an XML document can be optionally specified by means of XML Schema, thus 
enabling the exploitation of structural information for efficient document handling. Upon schema 
evolution, or when exchanging documents among different collections exploiting related but not 
identical schemas, the need may arise of adapting a document, known to be valid for a given schema 
S, to a target schema S' . The adaptation may require knowledge of the element semantics and cannot 
always be automatically derived. In this paper, we present an automata-based method for the static 
analysis of user-defined XML document adaptations, expressed as sequences of XQuery Update 
update primitives. The key feature of the method is the use of an automatic inference method for 
extracting the type, expressed as a Hedge Automaton, of a sequence of document updates. The 
type is computed starting from the original schema S and from rewriting rules that formally define 
the operational semantics of a sequence of document updates. Type inclusion can then be used as 
conformance test w.r.t. the type extracted from the target schema S'. 



1 Introduction 

XML is a widely employed standard for the representation and exchange of data on the Web. XML 
does not define a fixed set of tags, and can thus be used in a great variety of domains. The structure of 
an XML document can be optionally specified by means of a schema, expressed as an XML Schema 
|[T6l or as a DTD ||2T1 . and the document structural information can be exploited for efficient document 
handling. A given XML schema can be used by different users to locally store documents valid for 
the schema. In a dynamic and heterogeneous world as the Web, updates to such shared schemas are 
quite frequent and support for dynamic schema management is crucial to avoid a diminishment of the 
role of schemas in contexts characterized by highly evolving and unstable domains. As a consequence 
of a schema update, document validity might need to be re-established and no automatic way to adapt 
documents to the new schema may exist, since the adaptation may require knowledge of the element 
semantics. Moreover, in case of a schema employed in different document collections, different choices 
may be taken by individual users handling different collections, depending on their specific knowledge 
of the documents in their collection. Consider for instance the case of an original schema containing 
an optional element address. The schema can be updated by inserting a zipcode sibling of address 
(optional sequence), so that now either valid documents do not contain address information at all, or, if 
an address is present, the zipcode needs to be present as well. The most obvious, automatic way to adapt 
documents could be that of mimic the schema update thus inserting a zipcode for each address occurrence 
in a document. However, in some cases it would be preferable to delete the address instead, thus restoring 
the document validity through a different operation (i.e., a deletion) not directly corresponding to the one 
occurred on the schema (i.e., an insertion). Moreover, depending on the application contexts, only the 
original schema 5 and the target schema 5' may be known, while the update sequence that transformed 
S in S' is not known. Individual users may thus specify document adaptations, intended to transform 
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any document valid for S in a document valid for 5". Methods able to validate the document adaptations 
specified by individual users are then useful to avoid the expensive run-time revalidation of documents 
resulting from the application of such adaptations. 

In this paper, we present an automata-based method, called HASA (Hedge Automata Static Ana- 
lyzer), for the static analysis of XML document adaptations, expressed as sequences of XQuery Update 
(XQUF) update primitives. The key feature of HASA is the use of an automatic inference method 
for extracting the type of a sequence of document updates. The type is computed starting from a static 
type assigned to an XML schema and from rewriting rules that formally define the operational semantics 
of a sequence of document updates. Type inclusion can then be used as conformance test w.r.t. the type 
extracted from the updated XML schema. Our types are represented via Hedge Automata (HA). Hedge 
Automata are a very flexible and general tool for manipulating trees. Indeed they can handle ranked and 
unranked ordered trees. Furthermore, validation algorithms for XML schemas are naturally expressed 
via Hedge Automata. It comes natural to extract the type of an XML schema in form of an Hedge Au- 
tomaton Ifl31. We exploit this feature in order to define a HA2HA transformation that produces the type 
of a document adaptation. Specifically, HASA takes as input two XML schemas S and S' such that S' is 
an evolution of (i.e., the result of a, possibly unknown, sequence U of updates on) S. For each schema, 
we automatically generate the corresponding types in form of the Hedge Automata A and A'. The user 
now provides a sequence of document updates u\,...,Uk (document adaptation) to make instances of 
S conform to the new schema S'. Given A, we compute the Hedge Automaton A\ = Post(ui,A) that 
recognizes the documents in A after the modification u\. We then repeat the computation for U2, 
producing a Hedge Automaton A* that recognizes the documents after the complete sequence of updates. 
The resulting automaton A* can now be compared with the Hedge Automaton A'. If the language of A% 
is included in that of A', the proposed document adaptation surely transforms a document known to be 
valid for S in a document valid for S'. If inclusion does not hold, we use the automaton A& as a tester 
to identify documents that do not conform to S' (i.e., testing whether the execution of the automaton A^ 
over the document corresponds to an accepting computation). 

In this paper we focus our attention on the technical details underlying the design of the HASA 
module. Specifically, our technical contribution is as follows: First, we introduce a parallel rewriting 
semantics for modelling the effect of a document update on a term-based representation of XML docu- 
ments. Our semantics is based on a representation of document updates as special types of term rewriting 
systems ifTTI . and on a parallel semantics for modeling the simultaneous application of a rewrite rule to 
each node that satisfies its enabling conditions (we consider here node selection only). As an example, 
we model renaming of label a into label b as a rewrite rule r = a(x) — > b(x) where x is a variable that 
denotes an arbitrary list of subtrees. A document is represented as a tree t. Renaming must be applied 
to all occurrences of label a in t, i.e., as a maximal parallel rewriting step computed w.r.t. r. A parallel 
rewriting semantics needs to be considered, instead of the more standard sequential semantics used in 
rewriting systems, to capture the semantics of more complex operations like document insertion. In case 
of document insertions, indeed, a sequential semantics may lead to incorrect rewriting steps (e.g., to 
recursively modify a subtree being inserted). 

We then move to the symbolic computation of types, i.e., of Hedge Automata that represent the 
effect of applying a document adaptation on the initial automaton A. More specifically, we give HA2HA 
transformations that simulate the effect of a parallel application of each type of update rules. A symbolic 
algorithm is defined to compute Post as a Hedge Automata transformation and proved correct w.r.t. our 
parallel rewrite semantics. This is the core operation of our HASA approach. Differently from other 
automata-based transformation approaches |[T9ll , we are interested here in calculating the effect of a 
single document update and not of its transitive closure. 



G. Delzanno, G. Guerrini, A. Solimando 



87 



Finally, a proof of concept implementation of the HASA module has been developed as a modifica- 
tion of the LETHAL library. 

The paper is organized as follows. In Section [2] some preliminary notions are introduced. Section 
[3] introduces Hedge Automata as a formalism to describe XML schemas, while Section |4] is devoted to 
XQuery Update primitives and to the corresponding update rewrite rules, with their parallel rewriting se- 
mantics. Section[5]describes the symbolic algorithm underlying the HASA module. Section|6]concludes 
by discussing related work and future research directions. 

2 Preliminaries 

In this section we introduce the notations and definitions (mainly from [6]) used in the remainder of the 
work. We refer to terms and trees as synonyms as in Given a string s G L C £* the set of its prefixes 
w.r.t. L is defined as Prefi(s) = {t \ s = tu At,u G L}. When the language is clear from the context we 
use Pref instead of Pre fa. Given a language L C £* we call prefix language the set of the prefixes of the 
elements of L: PrefixesiV) = \J s eLP re fL{s)- A language L C £* is said prefix-closed if Prefixes(L) = L, 
that is, if the language contains every possible prefix of every string belonging to the language itself. 

A term is an element of a ranked alphabet defined as (£, Arity), where E is a finite and nonempty 
alphabet, Arity : £ — > N is a function that associates a natural number, called arity of the symbol, with 
every element of £. The set of symbols with arity p is denoted as £ p (for the sake of conciseness we will 
use a compact notation, e.g., /(, , ) is a term contained in £3). £0 is called the set of constants. Let X be 
a set of variables, disjoint from £0. The set T(£, X) of the terms over £ and X is defined as: (1) £0 C 
r(£, X), (2) X c r(£, X), (3) if / G Ep, p>0 and ti,...,t p € T(L, X), then f(t u . . .,t p ) G T(L, X). 
If X = we use r(E) for r(E, X) and its elements are called ground terms, terms without variables. 
Linear terms are the elements of T(£, X) in which each variable occurs at most once. 

A finite and ordered ranked tree t over £ is a map from a set &os(t) C N* into a set of labels 
£, with &os{t) having the following properties: (1) &os(t) is finite, nonempty and prefix-closed, (2) 
Mp G &os(t), if t{p) G £„ and ?i > 0, then {j \ p.j G &>os(t)} = {1, ...,«}, (3) Mp G &>os(t), if f(p) G 
£oU JT, then {j | G ^"o^f)} = 0. Root(t) = t(e) is called root of the tree. An unranked tree 
t with labels belonging to a set of unranked symbols £ is a map t : N* — > £ with a domain, denoted 
as &os{t), with the followings properties: (1) ^os{t) is a finite, nonempty and prefix-closed, (2) for 
every p G 3?os{t) {j \ p.j G &os(t)} = {1, . . . ,k} for some /c > 0. The set of unranked trees over £ is 
denoted as T(L). The subtree t\ p G T(£, X) is the subtree in position p in a tree ? G T(£, JT) such that 
&os{t\p) = {j | p.j G ^05(f)} and Vg G @>os(t\ p ) . t\ p (q) = t(p.q). 

An example of unranked tree is t = a(b(a,c(b)),c,a(a,c)). Note that the same label can be used in 
different nodes which may have a different number of children (an arbitrary but finite value). An example 
of subtree is?|i = b(a,c(b)). 

3 Hedge Automata (HA) and XML Documents 

Tree Automata (TA) are a natural generalization of finite-state automata to define languages over ranked 
finite trees (instead of finite words). TA can naturally be used as a formal support for document vali- 
dation lTT4l [T3ll . In this setting, however, it is often more convenient to consider more general classes 
of automata, like Hedge and Sheaves Automata, to manipulate both ranked and unranked trees. Indeed, 
in XML documents the number of children of a node with a certain label is not fixed a priori, and dif- 
ferent nodes sharing the same label may have a different number of children. Hedge Automata (HA) 



88 



Automata-based Static Analysis of XML Document Adaptations 



are a suitable formal tool for reasoning on a representation of XML documents via unranked trees. HA 
are a generalization of TA because in the latter only ranked symbols are supported and the horizontal 
languages are fixed sequences of states whose length is the rank of the considered symbol. We introduce 
the main ideas underlying HA definition in what follows. 

Given an unranked tree a(t\,. . . ,t n ) where n > 0, the sequence t\, . . . ,t n is called hedge. For n = 
we have an empty sequence, represented by the symbol e. The set of hedges over £ is H(L). Hedges 
over £ are inductively defined in [3] as follows: the empty sequence £ is a hedge, if g is a hedge and 
then a(g) is a hedge, if g and h are hedges, then gh is a hedge. For instance, given a tree 
t = a(b(a,c(b)),c,a(a,c)), the corresponding hedges having as root nodes the children of Root{t) are 
b(a,c(b)), c and a(a,c). 

A Nondeterministic Finite Hedge Automaton (NFHA) defined over Z is a tuple M = (Q,L,Qf,A) 
where S is a finite and non empty alphabet, Q is a finite set of states, Qf C Q is the set of final states, 
also called accepting states, A is a finite set of transition rules of the form a(R) — > q, where a G L, q G Q 
and R C Q* is a regular language over Q. Regular languages denoted as R that appear in rules belonging 
to A are said horizontal languages, represented with Nondeterministic Finite Automata (NFA). The use 
of regular languages allows us to consider unranked trees. For instance, a(q*) matches a node a with any 
number of subtrees generated by state q. 

A computation of M over a tree t G T (£) is a tree M\\t having the same domain of t and for which, for 
every element p G 0Pos{M\ \t) such that t{p) = a and M\ \t(p) = q, a rule a(R) — > q in A must exist such 
that, if p has n successors p.l, . . . ,p.n such that M\ \t(p.l) =q\,... ,M\\t(p.n) = q n , then q\ •■ -q n € R. If 
n = (that is, considering a leaf node) the empty string e must belong to the language R of the rule to be 
applied to the leaf node. A tree t is said to be accepted if a computation exists in which the root node has 
a label q G Qf. The accepted language for an automaton M, denoted as L(M) C T(£), is the set of all the 
trees accepted by M. 




1 1 ql ql qOqOqO 

(a) Tree t representing a true Boolean (b) Accepting computation of the au- 
formula. tomaton M over tree t. 



Figure 1: An example of tree t (left) and the computation M\ \t of the automaton M over t (right). 



(Q,-L,Q f ,A) where Q 
-> qo,l{e) -> qi,0(e) - 



{<?()> <7i}> ^ = {0, 1, not, and, or}, Qf ■ 
• q ,and(Q*q Q*) qo,and(q\q\) - 



As an example, consider the NFHA M - 
{q{\ and A = {not(qo) — > q\,not(q\) 
<7i > or{Q*q\ Q*) — > q\, or(qoq^) — > qo}. Figure 1(a) shows a tree t representing a Boolean formula. Figure 
1(b) shows the accepting computation of the automaton M (i.e., M||f(e) = <?i G Qf). Note that though 



and, or are binary logic operators, we used their associativity to treat them as unranked symbols. The 
equivalent TA differs from the HA only in the rules for these binary operators A = {. . . ,and(qo,qo) 
q Q ,and(q Q ,qi) ^ q ,and(qi,qo) ^ q ,and(qi,qi) -> qi,or(q ,q ) -> q {) ,or(qQ,qi) -> q u or(qi,q Q ) -> 
q\,or(qi,qi) -^q u ...}. 

A NFHA M = (Q,H,Qf,A) is said normalized if, for each a £ ~L,q G Q at most one rule a(R) — > 
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update rule 


XQUF primitive update operation 


a(x) — > b(x) 


REN 


a(x) — > p 


RPL 


a(x) () 


DEL 


a(x) — > a(px) 


INS first 


a{x) — > a{xp) 


INS last 


a(xy) — > a{xpy) 


INS int0 


a(x) —7- pa(x) 


INSbefore 


a(x) — > a(x)p 


INS after 



Table 1: XQUF primitives, a and b are XML tags, p is a state of an HA, and x,y are free variables that 
denote arbitrary sequences of trees. 

q € A exists. Since string regular languages are closed under union 0, it is always possible to define a 
normalized automaton starting from a non normalized NFH. Every pair of rules a{R\ ) — > q and a(R2) — > q 
belonging to A is substituted by the equivalent rule a(R\ UR2) — > q. 

Given two NFHA M\ and M 2 , the inclusion test consists in checking whether L{M\) C L{M2). It can 
be reduced to the emptiness test for HA (L(Mi) C L(M 2 ) <^ L{M\)f\ (T(L) \L(M 2 )) = 0). Inclusion test 
is decidable, since complement, intersection and emptiness of HA can be algorithmically executed [6]. 

4 XQuery Update Facility as Parallel Rewriting 

XQUF [7] is an update language for XML. Its expressions are converted into an intermediate format 
called Pending Update List (PUL). In this paper we consider a formulation of PULs as a special class 
of rewriting rules defined on term symbols and types (states of Hedge Automata) as suggested in ifTTTl . 
More specifically, we use the set of rewriting rules defined in Table [T] The idea is as follows. Target 
node selection is based on the node label only (and not on hierarchical relationships among nodes). In 
Table [TJ a and b are node labels, and p is an automaton state that we interpret as type declaration (it 
defines any tree accepted by state p). The supported update primitives allows for renaming an element 
(REN), replacing an element and its content (RPL), deleting an element (DEL), inserting a subtree as 
a first, last, or an arbitrarily positioned child of an element (INS fi rst ,INSi ast ,INSi nto , respectively) and 
inserting a subtree before or after a given element (INSbefore, lNS a f ter , respectively). According to Q, 
the semantics (i.e., the actual insert position) of INSj nto is implementation dependent. In real systems, in 
several cases the operation is simply not provided or it is implemented either as INSfi rst or as INSi ast . 

To illustrate the update rules, consider for instance the rule REN a(x) — > b{x). Given a tree t, the rule 
must be applied to every elements with label a. Indeed, x is a free variable that matches any sequence of 
subtrees. If the rule is applied to element e with label a and children t\ , . . . , t^, the result of its application 
is the renaming of a into b, i.e., the subterm a(t\,. . . ,tt) is replaced by the subterm b{t\, ...fy). Consider 
now the rule INSfj rst defined as a(x) — > a(px), where p is a type (a state of an HA automaton). Given a 
tree t, the rule must be applied to every element with label a. If the rule is applied to element e with label 
a and children t\,...,tk, the result of its application is the insertion of a (nondeterministically chosen) 
term t of type p to the left of the current set of children, i.e., the subtree a(t\, ... ,tt) is replaced by the 
subterm a(t,t\, . . . ,tk). To model the application of an XQUF primitive rule of Table[T]to each occurrence 
in a term, we define next a maximal parallel rewriting semantics denoted via the relation =^>, (formally 
defined in [18 ]). In the previous example, INSfi rst inserts a tree of type p to the left of the children of 
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each one of the a-nodes in the term t. 

To assign a formal meaning to our rewriting system, we first define the general class of rules we 
adopt here and then we specify the semantics needed to model document adaptations. 

4.1 Parameterized Hedge Rewriting System 

Let A = (L,Q,Qf,A) be an HA (whose states are used as types in the rules). A Parameterized Hedge 
Rewriting System (PHRS) fill R/A is a set of hedge rewriting rules of the form L — ^ R, where L G 
H(L, JT), and R € //(Zttl Q, 3£). As in Table[Tj we restrict our attention to linear rewriting rules (with a 
single occurrence of each variable in the left-hand and right-hand side). In fl9l and ifTTTl the operational 
semantics of update rules is sequential because it applies a single rewriting rule at each step (both the 
rule and the term to which it is applied are chosen in a nondeterministic way). An XML document 
update, instead, has a global effect. For instance, when renaming a label in an XML schema, all the 
nodes having that label must be renamed. Such an update may be expressible through maximal steps of 
sequential applications of the REN rewriting rule. 

Maximal sequential rewrite is not applicable to insertion rules like INSfi rst = a(x) — > a(px): se- 
quential applications of lNSfj rst may select a single target node more than once, thus yielding incorrect 
results. For instance, let t = a(a(b,c),b) be the tree representation of an XML document and t' = die) 
the tree corresponding to an XML fragment. Consider the insertion of t' into t as first child of all the 
nodes labelled by a through the operation r defined as a{x) — > a{t'x). If we use the standard sequen- 
tial semantics of term rewriting we need two applications of rule r, one for each node matching the 
left-hand side. This leads to terms like t\ = a(a(d(e),d(e),b,c),b), ?2 = a(d(e),d(e),a(b,c),b), and 
?3 = a(d(e),a(d(e),b,c),b). The intended semantics of INSfi rst requires r to be applied to all matching 
occurrences of a{x) in t, therefore only the latter term corresponds to a correct transformation of t. 

4.2 Parallel Rewriting 

In order to capture the meaning of update rules as document adaptation we introduce a new parallel 
rewriting semantics for PHRS. In what follows we give the main ideas underlying the formal definition 
which is presented in fl~8l . 

Given a term t and an update rule r = L — > R, we first identify the set of positions in the term t 
that match the left-hand side L of the rule. The set of positions in t (strings of natural numbers, see 
preliminaries) is ordered according to the lexicographic ordering <i ex . t\ p denotes the subtree at position 
p. Let Target{t,r) be the </ eA -ordered list of nodes that match the left-hand side of rule r. A substitution 
is a map {x\ <— ti, . .. ,x n <— t n } that substitutes Xi with tu where i £ [\,n]. A substitution is extended to 
terms with variables in the natural way. 

A parallel rewriting step of a rule r on a tree t is defined as a transformation of t into a new term t' 
obtained as follows. The tree t is visited bottom-up starting from its leaves. Every time a node a(t) that 
matches the rule a{x) — > R via the substitution a is encountered, we replace a(t) with Ra and then we 
move to the parent of the node. 

The transformation is defined following a descreasing lexicographic ordering in Target (t,r). To 
process the current position, we first compute the contexts in which the rewrite step takes place (to 
preserve the part of the tree that is not rewritten), and then we replace the matched left-hand side with 
Ra. The INSj nt0 rule requires some care because the insertion position is nondeterministically selected 
among the set of children of the matched node. 
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Now we show an example that involves the INS a f ter rule, for the term t = b(c,d(c(a),a)), and the 
rule r = c(x) — > c{x)p where p is a type that contains at least the terms t% = a{b) and t\ = a(c(a),c(a)) 
as possible instances. The set of positions in t is defined as &os(t) = {e, 1,2,2.1,2.2,2.1.1}. The rule r 
matches nodes of t at positions 1 and 2.1. 

• We start from the greatest position 2.1 and compute the context C2 defined by the term b{c,d(y, a)) 
(a context is obtained by replacing the subtree at position 2. 1 with a fresh variable y). The substitu- 
tion 02 = {x <— (a)} is the result of matching c{x) with c(a). We can now rewrite the context C%\y\ 
as C2[/?20"2] where R2 is obtained by instantiating p with term ti- This gives us the intermediate 
tree b(c,d(c(a),a(b),a)) (the new subtree is underlined). 

• We now consider the position 1, extract the context C\ = b(y,d(c(a),a(b),a)) and consider the 
matching substitution di = {x -s— e} between c(x) and c. We apply the rewrite step by substituting y 
with Roi and obtain the new term C\ \R\ 0\] = b(c,a(c(a) , c(a)),a(b) ,d{c{a) ,a(b),a)) (the inserted 
subtree is underlined) that corresponds to the result of the parallel rewriting step. 

We remark that a rule with a type term like p may yield different instantiations of p in the same parallel 
step (as in the previous example). The definition can be extended in a natural way to a set R of update 
rules. We use => R to denote the resulting relation and => R to denote its transitive-reflexive closure. 

Finally, we define Post R / A (S), where S C ££) and R/A is a PHRS based on update rules, as the 
language obtained by a single application of rules in the set R to each element of S through the parallel 
rewriting semantic associated to update rules. When R and A are clear from the context the shorthand 
Post (S) is employed. 

5 Hedge Automata-based Static Analysis (HASA) 

In this section we describe the symbolic algorithm underlying the HASA module. As mentioned in the 
introduction, our goal is to effectively compute the effect of a document adaptation on each tree that is 
accepted by a given HA A. For this purpose, fixed an update rule r we define a HA transformation from 
A to a new HA A' such that L(A') = {t' 1 1 =>,• t', t € L(A)}. In order to define such a transformation we 
need to carefully operate on the vertical component of A (rewriting rules that accept the node labels) as 
well as on the horizontal languages (e.g., for operations like insertion). The INSi nt0 rule is discussed at 
the end of the section. We anticipate that the nondeterminism in the choice of the insertion position may 
introduce the need of considering several alternatives for the Post computation for the same instance rule. 
In practical implementations this can be avoided since the semantics in INSi nto rule is always resolved in 
favour of some fixed insertion position. In what follows we provide some examples of the construction. 
The correctness proof of the algorithm w.r.t. to our parallel semantics is given in [181, due to space 
limitations. 

Given two HA A = (L a ,P,Pf, 0) (the HA that describes the types occurring in the update rules) and 
Al = (Ll, Ql, Q[, Al) (the HA that describes the structure of a set of documents) such that A and Al are 
normalized automata, Pn Q L = 0, L = L(A L ), we define the HA A' = (L:=L Ul. L ,PllQ L ,Q[,A') such 
that L(A') = Post R / A (L). The transition relation A' is defined on top of individual laws, one for each type 
of update rewriting rule. 

For each a € E, q € Ql, we denote with L a>q the horizontal language of the unique rule a{L a4l ) ->^G 
A L , accepted by theNFA5 a , 9 = (QL,S a ,q,i a ,q, {fa,q}^a,g)- 

As a preliminary operation we need to expand the alphabet of each automaton that recognizes the 
horizontal languages, from Ql to PL) Ql- For each of the following rules we assume p G P, which allows 
only hedges included in the language L(A) to be inserted. 
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For the operations INS be f ore , INS a f ter , RPL and DEL either some states q £ Ql involved in a change 
could be shared among different symbols in E, or two rules a(L a ) — > q, b(L b ) — > q G Al could exist such 
that a^b. To avoid an unwanted change for symbol b a fresh state q{ resh ^ PU Ql is created and, for 
each rule in which the label a and the state q appear simultaneously, a copy of this state is created and 
q is replaced by ql resh '. As last step, q{ resh is added to Ql and to any other alphabet belonging to the 
horizontal languages, while updating also their transitions. These changes must be applied before any 
other modification. 

In the following we present the modification rules for each XQUF primitive rule. 
Renaming: REN 

Given the rule a(x) — > b{x) G R/A, where a,i £ I, for each q G Ql such that L{B aq ) / holds, then 
if L(B b ^ q ) = we define B bq := B aA , by changing the indexes of the various elements. By contrast, if 
L{Bb, q ) / 0, we define a new version of Bb, q as the automata that recognize the union of L(B a ^ q ) and 
L{B Kq ), i.e. , B biq = (Q L ,S a , q W S b , q t+J i ab , q , i abiq , {f a<q } W {f hq } , F a<q l+l Y bq l±) { (i ),(iab,q,£,ib,q)})- 
Finally, we remove the rules of the form a{L aA ) — > q from A^, where q G Ql and we add the corre- 
sponding rule b(L b ^ q ) — > q for each deleted transition. These changes on one hand allow the automaton 
to accept the label b where the old automaton accepts label a. On the other hand they preserve the 
"behaviour" of the label b in the horizontal languages it can be evaluated. 

Insert first: INSfi rst 

The rule a(x) — > a(px) G R/A leads to change the automaton B a q , for each q € Ql such that L chq / 0. 
A fresh state ql™ sh such that q£% sh ^ S a , q is created, then it is added to S a _ q and used as an initial state. 
After that, if T a jq = holds, the transition {ff™ q ,p,f a ,q) is added to T a ^ q . Otherwise, for each transition 
of the form {i a ,q,y,qy) £ ~^a,q, where i a q is an initial state, y G PU Ql, q y G S a , q , a transition of the form 
{ql" q sh ,p,i a ,q) is added to T aA . 

Figure 2: The changes to the horizontal automaton due to rule INSfi rst are depicted as grey texts and 
dotted lines. 




Insert last: INSi ast 

The rule a(x) — > a(xp) G R/A leads to change the automaton B a q , for each q G Ql such that L a q ^ 0. A 
fresh state q{ r , q sh such that q{ r :q sh ^ S a ^ q is created, added to S a , q and used as final state. Then, if T a ^ = 
holds, the transition (i a ,q,P,q%q Sh ) is added to T a ^ q . Otherwise, for each rule of the form {q y ,y,f a ,q) G 
T aA , where y G PU Q L , q y G S a , ? , a transition of the form (f a . q ,p,q%q Sh ) is added to T a , q . 

Insert before: INS be f ore 

For the rule a(x) — > pa(x) G R/A we need to modify each horizontal language in which a state q <E Ql 
such that L(B a q ) ^ may occur. For each q £ Ql such that L(B a ^ q ) /0a fresh state <j£^* is created 
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^y^^Q) "->fc| 

Figure 3: The changes to the horizontal automaton due to rule !NSi ast are depicted as grey texts and 
dotted lines. 

such that q{™q h ^ Sb, z , for each and z G Ql- Then, ql™ sh is added to Sb jZ if at least one transition of 
the form (s,q,s r ) G r^, where s,s' G Sb. z exists. These transitions are changed to (s,p,ql^q Sh ), after that 
the corresponding transitions to {qt^' 1 ,q,s') are added to 




p\ t _ /q 

^gfreshV* 



Figure 4: The changes to the horizontal automaton due to rule INS be fore are depicted as grey texts and 
dotted lines. 



Insert after: INS a ft er 

For the rule a(x) — > a(x)p G R/A we need to modify each horizontal language in which a state q G Ql 
such that L(B aiq ) / may occur. For every q G Ql such that L(B a q ) /0a fresh state is created 
such that qt r y q Sh ^ Sb tZ , for each and z G <2l- This new state is added to Sb iZ if at least one transition 
of the form (s,q,s r ) G Tb. z , where s,s' G Sb, z exists. These transitions are changed into (s,q,q{, r q Sh ), after 
that the corresponding transitions of the form (ql^ sh ,p,s') are added to Fb iZ . 

q\ 4 t /p 

^i a freshV 

Figure 5: The changes to the horizontal automaton due to rule !NS a fter are depicted as grey texts and 
dotted lines. 



Replace: RPL 

For the rule a{x) — > p G R/A we need to modify each horizontal language in which a state q G Ql such 
that L(B a , q ) / may occur. Each transition of the form (s, q, s') included in r^ z , where b G £ and z G <2l, 
is changed into (s,p,s'). 



r =0. 

a,q ■ 



0: 




*:•. ■ " *v 
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P 



Figure 6: The changes to the horizontal automaton due to rule RPL are depicted as grey texts and dotted 
lines. 

Delete: DEL 

For the rule a(x) — > () G R/A we need to modify every horizontal language in which a state q G Ql such 
that L(B a q ) / may occur. Each transition of the form (s,q,s') in r^ z , where b G £ and z G <2z.> is 
changed into (s,e,s'). 



Figure 7: The changes to the horizontal automaton due to rule DEL are depicted as grey texts and dotted 
lines. 

In the end, A' is computed as A' := &U{a(B aq ) — > q \ a G ~L,q G QL,L(B a ^ q ) / 0)}. The transitions of 
ensures that A' is able to evaluate any subtree belonging to L, the other transitions are used by A' for the 
evaluation of the elements of L(A) with the changes due to the update operations. The test L(B a ^ q ) / 
excludes unnecessary transitions. 

To preserve the tree structure of an XML document we need to avoid the application of the operations 
IN Sb e fore, INS a fter and DEL, of the form a(x) — > pa(x), a(x) — > a(x)p and a(x) — > (), respectively, to 
any tree t G T(L) such that t(e) = a. 

Insert into: INSi nto 

The simulation of the INSi nto rule requires some care. The rule inserts a subtree in a nondeterministically 
chosen position in between the children of a given node. Since the position is not known in advance we 
can only guess a state s of the horizontal automata and replace its outgoing transitions with transitions 
passing through a fresh state. However we may need to consider an automaton for every such state s. 
We describe next the Post construction for a given choice of s. The rule a(xy) — > a(xpy) G R/A leads to 
change the automaton B a ^ q , for each q&Qi such that L aA / 0. A fresh state q£™ sh such that q{ r )q sh G" S a , q is 
created and added to S a , q - At this point, for each state s G S a , q reachable from i a . q through the transitions 
in T a each transition of the form (s,j,s') is changed into one of the form (s,j,qa^ q sh ), where j G PU Ql 
and s' G S a . q , and transitions of the form (q^ q s \p,s') are added to T a ^ q . 

The need of guessing the right position in the horizontal automata in which inserting a fresh state 
generates several possible Hedge automata for each occurrence of lNSi nt0 . However, in real implemen- 
tations this operation often reduces either to INSfi rst or INSi as t- Thus in practical cases, this avoids the 
need of introducing a search procedure in our HASA module. 

Example 1. Suppose we have two NFHAs Ai = (£<l,Ql,Q{,Al) and A = (L,P,Pf,®) defined as follows: 




epsilon 
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a,q 



j\ , , /p 

NLfreshV 

Figure 8: The changes to the horizontal automaton due to rule INSi nt0 are depicted as grey texts and 
dotted lines. 

• = {a,b,c} and £ = {a,b,d}, 

• Ql = {q a i,q a 2,qb,qc} andP = {g a ,gb,gd\, 

• Q L = {qauqai} and Pf = {g a }, 

• A L = {a{q h *) -> q a 2,a(q b *qc) -> q a i,b(s) -> #,,c(e) -> g c }, 

• = -+gb,d{e) ^gd}- 

The NFA used for the horizontal languages of the NFHA Ai are: 

• B a , 9a i =(GL,^ nl ={pi,,p c }, J P6,{p c }) r a,9 a i = {(Pb,qb,Pb),(Pb,q C ,Pc)}), 

• B a,?a2 = {QL,S a ,q a2 = {m b },m b ,{m b },T^ qa2 = {(m b ,q b ,m b )}), 

• Bb,q b = (QL,S b , qb = {n},n,{n},T bm = {}), 

• B Ciqc = (Q L ,S c , qc = {o},o,{o},T c , qc = {}). 

It is clear that L(A£) = {a(bc),a(bbc), . . . ,a(b. . .be),. . . ,a,a(b),a(bb),. . .,a(b. . .b), . . .} and that 
L(A) is the set of the unranked tree where the root node is labelled with a, where the internal nodes are 
labelled with b and where the leaves are labelled with d. Now we apply the update sequence s = {REN : 
b(x) — > a(x), INS fi rs t : c(x) — > c(g a x),INS be f ore : c(x) — > g a c(x)} composed of update operations ofR/A 
andwe compute the NFHA A' = (LL)Ll,PL)Ql, <2l> A') such that L(A') =Post R / A {L). 

REN :b(x) — > a(x); the NFA B aAb = (PL) Qi, S b>qb = {n}, n, {n}, T bqb = {}) is defined and all the 
occurrences of label b in the horizontal rules are replaced with label a. 

INS fira : c(x) ->• c(g a x); theNFAB c , qc is changed into (PUQ L ,{q^ q s c h ,o}, qt™ h , {o}, {(q¥qf\g a ,o)}). 
INS bef0 re : c(x) -»■ g a c(x); the NFA B a ^ al is changed into {PL>Q L ,S a , qal = {pb,Pc,q£™ h }iPb,{Pc}> 

F a,q al = {(Pb,ga,q^ k ),(qc"qf\qc,Pc),{Pb,qb,Pb)})- 



In Figure 9(a) we can see an example of application of the update operations REN, INSfi rs t and 



INS be f ore that transform tree t E L into t' £ L(A'). In Figure 9(b) we can see an accepting computation 
of the NFHA A' related to tree t'. □ 



6 Related Work and Conclusions 

We have developed a Java prototype based on the LETHAL LibrarjQ The experiments started from the 
XML benchmark used in the XMark Benchmark Projecj^] We tested the complete set of update primi- 
tives, both in isolation and in a sequence of updates. The modified schema, that is intended to be obtained 



LETHAL is available at http : //lethal . sourcef orge .net/ 

2 The benchmark and related schema are available at http : //www . xml-benchmark . org/ 
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d d g d g d 

(a) The parallel application of REN, INSfi rs , and IN She fore changing (b) The accepting com- 
t € L into t' <E L(A'). putation of the NFHA A' 

over t' . 

Figure 9: An example of update from tree t to t' (left) and the accepting computation of the automaton 
A' accepting the updated language over updated tree t' (right). 



from a schema update sequence, is manually generated. A valid (resp. invalid) sequence of document up- 
dates is tested by means of our symbolic computation and by means of inclusion test for HA provided by 
the library. The Post algorithm works on a representation of horizontal languages as regular expressions 
(we adapted our algorithm to deal with it) and then computes a new HA. This is due to limitations of the 
LETHAL library, which is not designed for low-level manipulations of automata but only for the appli- 
cation of common HA operations (inclusion, union, intersection, etc.). Despite inclusion test complexity 
for NFHA is ExpTime-Complete [6], the execution times of the Post computation and of the inclusion 
test on the considered XML benchmark are negligible (less than Is) even with a naive implementation. 
These results are not surprising because the automaton size depends on the corresponding schema size, 
that is usually limited (in terms of labels and productions) in practical schemas. In addition, schema size 
in not comparable with the one of the associated document collection (in terms of document number and 
size). The results show the potential of our proposal for a practical usage as a support for static analysis 
of XML updates. Before addressing possible extension, we discuss next some related work. 

Concerning related work on static analysis, the main formalization of schema updates is represented 
by HJ, where the authors take into account a subset of XQUF which deals with structural conditions 
imposed by tags only. Type inference for XQUF, without approximations, is not always possible. This 
follows from the fact that modifications that can be produced using this language can lead to nonregular 
schemas, that cannot be captured with existing schema languages for XML. This is the reason why fl], 
as well as [ 19 ], computes an over-approximation of the type set resulting from the updates. In our work, 
on the contrary, to produce an exact computation we were forced to cover a smaller subset of XQuery 
Update features: (TJ, indeed, allows the use of XPath axes to query and select nodes, allowing selectivity 
conditions to be mixed with positional constraints in the request that a given pattern must satisfy. In our 
work, as well as in [19] and irTTTl . we have considered update primitives only, thus excluding complex 
expressions such as "for loops" and "if statements", based on the result of a query. These expressions, 
anyway, can be translated into a sequence of primitive operations: an expression using a "for loop", for 
instance, repeats n times a certain primitive operation, and therefore can be simulated with a sequence of 
n instances of that single primitive operation]^] However, tests for loops and conditional statements based 

3 The interested reader could refer to 1 1| (Section "Semantics"), where a translation of XQUF update expressions into a 
pending update list, made only of primitive operations, is provided, according to the W3C specification jT). 
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on query results over documents are of course not expressible working only at schema level. Macro 
Tree Transducers (MTT) lTT2ll can also be applied to model XML updates as in the Transformation 
Language (TL), based on Monadic Second-Order logic (MSO). TL does not only generalize XPath, 
XQuery and XSLT, but can also be simulated using macro tree transducers. The composition of MTT and 
their property of preserving recognizability for the calculation of their inverses are exploited to perform 
inverse type inference: they pre-compute in this way the pre-image of ill-formed output and perform 
type checking simply testing whether the input type has some intersection with the pre-image Their 
system, as ours, is exact and does not approximate the computation, but, in contrast to our method there 
is a potential implementation problem (i.e., an exponential blow-up) for the translation of MSO patterns 
into equivalent finite automata, on top on which most of their system is developed, even if MSO is not 
the only suitable pattern language that can be used with their system. Thus, our more specific approach, 
focused on a specific set of transformations, allows for a simpler (and more efficient) implementation. 

Our approach complements work on XML schema evolution developed in the XML Schema con- 
text JUm, where validity preserving schema updates are identified and automatic adaptations identified, 
when possible. In case no automatic adaptation can be identified, the use of user-defined adaptation is 
proposed, but then a run-time (incremental) revalidation of all the adapted documents is needed. Sim- 
ilarly, in [[H) a unifying framework for determining the effects of XML Schema evolution both on the 
validity of documents and on queries is proposed. The proposed system analyzes various scenarios in 
which forward/backward compatibility of schemas is broken. In IPT7) a related but different problem is 
addressed: how to exploit the knowledge that a given document is valid with respect to a schema S to 
(efficiently) assess its validity with respect to a different schema S'. Finally, document update transfor- 
mation is addressed in [2J, which investigates how to rewrite (document) updates specified on a view 
into (document) updates on the source documents, given the XML view definition. 

The present work can be extended along several directions. Node selection constraints for update 
operations could be refined, for example using XPath axes [ 10] and the other features offered by XQUF. 
It may be interesting to integrate the existing Java prototype of the framework with XML schema evo- 
lution tools like EXup |4). Moreover, when the schema update operation sequence is known, a heuristic 
to automatically extract a sequence of update operations that will ensure document validity with respect 
to the new schema, relieving the user from specifying the appropriate sequence, and generalizing the 
automatic adaptation approach currently supported in EXup, could be devised. Finally, support for com- 
mutative trees, in which the order of the children of a node is irrelevant, could be added. This feature 
would allow the formalization of the all and interleave constructs of XML Schema [ 16 ] and Relax NG 
(5J [20), respectively, and the overcome of the need of considering several alternative automata for the 
INS imo operation. Sheaves Automata, introduced in (22), are able to recognize commutative trees and 
have an expressiveness strictly greater than the HA considered in this work. The applicability of these 
automata in our framework needs to be investigated. 
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